【从零构建大模型】三、从零实现一个 GPT 模型以生成文本

概览

构建大模型的全景图如下,本文介绍了基础GPT-2系列的模型架构。

介绍的脉络如下:

介绍

Coding an LLM architecture

一个参数量为124 million的GPT-2模型包括了以下的定义参数:

1
2
3
4
5
6
7
8
9
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
  • vocab_size:单词数量,也就是BPE解码器所支持的单词数量

  • context_length:最长允许输入的token数量,也就是常说的上下文长度

  • emb_dim:Embedding层的维度

  • n_heads:多头注意力中注意力头的数量

  • n_layers:transformer块的数量

  • drop_rate:为了防止过拟合所采用的丢弃率,0.1意味着丢弃10%

  • qkv_bias:Liner层是否再加一个bias层

这些参数在初始化GPT模型的时候采用如下的使用方法(一些层还没有介绍,先留白):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn


class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])

# Use a placeholder for TransformerBlock
self.trf_blocks = nn.Sequential(
*[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])

# Use a placeholder for LayerNorm
self.final_norm = DummyLayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)

def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits


class DummyTransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
# A simple placeholder

def forward(self, x):
# This block does nothing and just returns its input.
return x


class DummyLayerNorm(nn.Module):
def __init__(self, normalized_shape, eps=1e-5):
super().__init__()
# The parameters here are just to mimic the LayerNorm interface.

def forward(self, x):
# This layer does nothing and just returns its input.
return x

Normalizing activations with layer normalization

LayerNorm归一化层的作用是将某一个维度中的参数都均值化到0,同时将方差归为1。其处理方法如下:

  • $\mu = \frac{1}{N} \sum_{i=1}^{N} x_i$:均值

  • $\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$:方差

  • $\varepsilon$:小值,防止除以0

而更灵活一点的实现会再额外添加了一个scale变量以控制各变量x进行缩放,还有shift变量来控制变量x进行平移。

简单使用代码实现,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))

def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift

需要注意的是我们这里的x其实就是样本,所以 $\mu$实际上并不是标准的均值,故理论上在计算方差时应该除以N-1,不然就是有偏的。不过由于GPT-2的结构是这样的,所以我们仿照它的做法,此外由于embedding层的维数N一般都比较大,所以N和N-1也差别不大。

Implementing a feed forward network with GELU activations

一个前馈网络的结构如下:

注意这里采用的是GELU激活函数,其表达式如下:

$$\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \cdot \left(x + 0.044715 \cdot x^3\right)\right]\right)$$

相比于RELU激活函数,GELU是一个平滑的非线性函数,近似于ReLU,但负值具有非零梯度(约-0.75除外)。

代码实现前馈网络如下:

1
2
3
4
5
6
7
8
9
10
11
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)

def forward(self, x):
return self.layers(x)

Adding shortcut connections

shortcut连接主要是为了解决梯度消息的问题,它将之前网络的输出与现在网络的输入相加后再进行传递,如下所示:

该机制的代码实现如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class ExampleDeepNeuralNetwork(nn.Module):
def __init__(self, layer_sizes, use_shortcut):
super().__init__()
self.use_shortcut = use_shortcut
self.layers = nn.ModuleList([
nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())
])

def forward(self, x):
for layer in self.layers:
# Compute the output of the current layer
layer_output = layer(x)
# Check if shortcut can be applied
if self.use_shortcut and x.shape == layer_output.shape:
x = x + layer_output
else:
x = layer_output
return x


def print_gradients(model, x):
# Forward pass
output = model(x)
target = torch.tensor([[0.]])

# Calculate loss based on how close the target
# and output are
loss = nn.MSELoss()
loss = loss(output, target)

# Backward pass to calculate the gradients
loss.backward()

for name, param in model.named_parameters():
if 'weight' in name:
# Print the mean absolute gradient of the weights
print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

layer_sizes = [3, 3, 3, 3, 3, 1]

sample_input = torch.tensor([[1., 0., -1.]])

torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=False
)
print_gradients(model_without_shortcut, sample_input)

# 没有shortcut连接
# layers.0.0.weight has gradient mean of 0.00020173584925942123
# layers.1.0.weight has gradient mean of 0.00012011159560643137
# layers.2.0.weight has gradient mean of 0.0007152040489017963
# layers.3.0.weight has gradient mean of 0.0013988736318424344
# layers.4.0.weight has gradient mean of 0.005049645435065031

# 有shortcut连接
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)
# layers.0.0.weight has gradient mean of 0.22169792652130127
# layers.1.0.weight has gradient mean of 0.20694108307361603
# layers.2.0.weight has gradient mean of 0.3289699852466583
# layers.3.0.weight has gradient mean of 0.2665732204914093
# layers.4.0.weight has gradient mean of 1.3258541822433472

Connecting attention and linear layers in a transformer block

一个transformer块的结构如下所示:

简单的代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from previous_chapters import MultiHeadAttention

class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back

# Shortcut connection for feed forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back

return x

transformer块采用这种结构的好处可以如下理解:

  • 多头注意力机制可以识别并分析出序列中各元属之间的关系

  • 前馈网络强化了局部的信息

  • 对每个位置进行特定的非线性变换,提升其独立特征表达

  • 协同效果:这种组合让模型既能捕捉全局模式,又能处理局部细节,从而在面对复杂数据模式时表现出更强的处理能力。

  • 【例】句子翻译任务:自注意力机制帮助模型理解句子结构和词语之间的关系;前馈网络对每个单词的特定信息进行调整和优化,从而生成更准确的翻译。

注意对于整一个Transformer结构,其输入和输出的形状最后是相同的。这种输入输出相同的设计更加方便各层之间的叠加,输入可以直接当做输出叠加上去。

例如下面的这个示例,我们采用batch size=2,上下文长度为4,每个embedding层的维度为768,最后得到的结果也是相同的形状。

1
2
3
4
5
6
7
8
9
10
11
torch.manual_seed(123)

x = torch.rand(2, 4, 768) # Shape: [batch_size, num_tokens, emb_dim]
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

# Input shape: torch.Size([2, 4, 768])
# Output shape: torch.Size([2, 4, 768])

Coding the GPT model

一个GPT模型的结构概览如下图所示。

对于一个124 million参数的GPT-2模型,其使用了12个transformer块,对于最大的 GPT-2 模型,其参数有 1.542 billion,这个 Transformer 块重复了 36 次。

叠加起来的代码实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])

self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)

def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits

注意模型最后的输出的结果是后续接着各个词的对应id的概率,其维度为字典的大小。注意其实际上生成了这个上下文中各个前缀生成的可能的词的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

# 输出
Input batch:
tensor([[6109, 3626, 6100, 345],
[6109, 1110, 6622, 257]])

Output shape: torch.Size([2, 4, 50257])
tensor([[[ 0.1381, 0.0077, -0.1963, ..., -0.0222, -0.1060, 0.1717],
[ 0.3865, -0.8408, -0.6564, ..., -0.5163, 0.2369, -0.3357],
[ 0.6989, -0.1829, -0.1631, ..., 0.1472, -0.6504, -0.0056],
[-0.4290, 0.1669, -0.1258, ..., 1.1579, 0.5303, -0.5549]],

[[ 0.1094, -0.2894, -0.1467, ..., -0.0557, 0.2911, -0.2824],
[ 0.0882, -0.3552, -0.3527, ..., 1.2930, 0.0053, 0.1898],
[ 0.6091, 0.4702, -0.4094, ..., 0.7688, 0.3787, -0.1974],
[-0.0612, -0.0737, 0.4751, ..., 1.2463, -0.3834, 0.0609]]],
grad_fn=<UnsafeViewBackward0>)

参数量分析

GPT2各个size的参数如下:

  • GPT2-small (the 124M configuration we already implemented):

    • “emb_dim” = 768

    • “n_layers” = 12

    • “n_heads” = 12

  • GPT2-medium:

    • “emb_dim” = 1024

    • “n_layers” = 24

    • “n_heads” = 16

  • GPT2-large:

    • “emb_dim” = 1280

    • “n_layers” = 36

    • “n_heads” = 20

  • GPT2-XL:

    • “emb_dim” = 1600

    • “n_layers” = 48

    • “n_heads” = 25

我们代码实现的是GPT2-small,但是如果我们直接将所有的参数量都统计出来会发现其值并不是124M,而是163M:

1
2
3
4
model = GPTModel(GPT_CONFIG_124M)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Total number of parameters: 163,009,536

这是由于在最初的GPT-2模型中使用了模型参数绑定,也就是self.out_head.weight = self.tok_emb.weight。因为tok_emb负责将id转化为对应的embedding,其行数是50257,列数是768维,而out_head负责将embedding再转为字典中词数量的维度,故可以复用。

减去out_head这一层的参数量后也可以看到确实就是124M的参数量

1
2
3
total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())
print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}")
# Number of trainable parameters considering weight tying: 124,412,160

Generating text

下图展示了生成文本的一个经典过程,即每次生成一个新的token,然后再将这个token拼接起来继续生成后一个token。

一个简单的生成文本的代码实现如下所示,在这里我们将只取输出的n_token那一层的最后一维,代表利用前面全部的信息所得到的后一个词的预测结果,然后再简单地进行softmax选取概览最高的那一个词作为输出。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def generate_text_simple(model, idx, max_new_tokens, context_size):
# idx is (batch, n_tokens) array of indices in the current context
for _ in range(max_new_tokens):

# Crop current context if it exceeds the supported context size
# E.g., if LLM supports only 5 tokens, and the context size is 10
# then only the last 5 tokens are used as context
idx_cond = idx[:, -context_size:]

# Get the predictions
with torch.no_grad():
logits = model(idx_cond)

# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]

# Apply softmax to get probabilities
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)

# Get the idx of the vocab entry with the highest probability value
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)

# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)

return idx

注意这时我们还没有对模型训练,所以模型前向传播得到的都是一些混乱的单词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
# encoded: [15496, 11, 314, 716]

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)
# encoded_tensor.shape: torch.Size([1, 4])

model.eval() # disable dropout

out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))
# Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843, 30961, 42348, 7267]])
# Output length: 10

decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
# Hello, I am Featureiman Byeswickattribute argue

参考资料


【从零构建大模型】三、从零实现一个 GPT 模型以生成文本
http://example.com/2025/05/03/LLMFromScratch3/
作者
滑滑蛋
发布于
2025年5月3日
许可协议