从零构建LLM 第7章：指令微调

📅 2026-03-02📖 ~8 min readLLMFine-tuningInstruction

本文是 Build a Large Language Model From Scratch（Sebastian Raschka 著）第7章的学习笔记。上一章让 GPT 学会了做分类，本章要更进一步 — 通过指令微调 (Instruction Fine-tuning) 让模型学会「遵循指令」。这正是 ChatGPT 等对话式 AI 背后的核心技术之一：把一个只会续写文本的模型，变成能理解用户意图并生成有用回答的助手 (Assistant)。

📖 Table of Contents

指令微调 vs 预训练 vs 分类微调
数据格式：Alpaca 模板
自定义 Collate Function：处理变长输入
训练与评估
为什么这很重要

← Ch6: 文本分类微调 LoRA深度解析 →

1. 指令微调 vs 预训练 vs 分类微调

到目前为止，我们的 GPT 经历了两种训练模式：

训练阶段	目标	数据	输出形式
预训练 (Ch5)	学习语言规律	大量无标注文本	续写文本
分类微调 (Ch6)	做出判断	标注的 (文本, 标签) 对	类别概率
指令微调 (Ch7)	遵循指令	(指令, 回答) 对	自然语言回答

第7章在整体 LLM 构建流程中的位置 — 指令微调是让 LLM 变成「助手」的关键步骤 (图源: LLMs from Scratch)

指令微调的核心思想：给模型大量「指令 → 回答」的示例，让它学会在看到指令后生成相应的回答。本质上还是预测下一个 token，但训练数据的格式从普通文本变成了结构化的指令-回答对。

2. 数据格式：Alpaca 模板

书中使用 Alpaca 格式（由 Stanford 提出）来组织指令数据。每条数据包含三个字段：

instruction：任务描述（如「将这段话翻译成法语」）
input：任务的输入内容（可选，有些指令不需要额外输入）
output：期望的回答

Python — Alpaca 格式与数据构造

def format_input(entry):
    """将指令数据格式化为 Alpaca 模板"""
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    # 如果有额外输入，添加 Input 字段
    input_text = (
        f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    )
    return instruction_text + input_text

# 示例数据
entry = {
    "instruction": "Identify the correct spelling of the following word.",
    "input": "Ocassion",
    "output": "The correct spelling is 'Occasion.'"
}

# 完整训练文本 = 格式化指令 + 回答
model_input = format_input(entry) + f"\n\n### Response:\n{entry['output']}"

# 输出:
# Below is an instruction that describes a task. Write a response ...
#
# ### Instruction:
# Identify the correct spelling of the following word.
#
# ### Input:
# Ocassion
#
# ### Response:
# The correct spelling is 'Occasion.'

ℹ️ 为什么格式很重要？模型通过固定的格式标记（### Instruction:, ### Input:, ### Response:）来区分指令和回答。推理时，我们只提供到 “### Response:” 为止的文本，让模型生成回答部分。如果格式不一致，模型可能会混淆指令和回答的边界。

3. 自定义 Collate Function：处理变长输入

指令微调的数据有一个特殊挑战：不同指令-回答对的长度差异很大。书中通过自定义 collate_fn 解决这个问题，它做了两件关键的事：

动态填充：将同一 batch 内的序列填充到该 batch 最长序列的长度（而非全局最大长度），节省计算
忽略填充位置的损失：将填充 token 的 target 设为 -100（PyTorch 的 cross_entropy 会自动忽略这个值），确保模型不会学习「预测填充符」

Python — 自定义 collate_fn

def custom_collate_fn(batch, pad_token_id=50256,
                      ignore_index=-100, device="cpu"):
    # 找到当前 batch 中最长序列的长度
    batch_max_length = max(len(item) + 1 for item in batch)
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]   # 添加结束标记

        # 填充到 batch 内最大长度
        padded = (
            new_item + [pad_token_id] * (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1])    # 输入 = 去掉最后一个
        targets = torch.tensor(padded[1:])    # 目标 = 去掉第一个

        # 关键：将填充位置的 target 设为 -100（loss 忽略）
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

⚠️ ignore_index=-100 的作用：PyTorch 的 cross_entropy(input, target, ignore_index=-100) 会跳过 target 值为 -100 的位置。这意味着填充 token 不产生梯度，模型不会浪费容量去学习「什么时候该输出填充符」。这个技巧在序列到序列任务中非常常见。

4. 训练与评估

书中使用 GPT-2 Medium (355M) 作为基础模型，在 1100 条指令数据上进行微调。训练流程复用了第5章的 train_model_simple 函数 — 指令微调的训练代码与预训练几乎一样，区别只在于数据格式。

指令微调的训练过程和损失曲线 (图源: LLMs from Scratch)

训练完成后，用格式化的指令提示模型，提取 ### Response: 之后的文本作为回答：

Python — 生成回答

def generate_response(model, instruction_entry, tokenizer, device):
    # 构造输入：指令 + "### Response:" 提示
    input_text = format_input(instruction_entry) + "\n\n### Response:"
    input_ids = tokenizer.encode(input_text)
    input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0)

    # 生成文本
    token_ids = generate(
        model, input_tensor,
        max_new_tokens=256,
        context_size=model.pos_emb.weight.shape[0],
        eos_id=50256  # 遇到 <|endoftext|> 停止
    )

    # 提取回答部分
    response = tokenizer.decode(token_ids.squeeze(0).tolist())
    response_text = response.split("### Response:")[-1].strip()
    return response_text

# 示例
entry = {"instruction": "What is a palindrome?", "input": ""}
print(generate_response(model, entry, tokenizer, device))
# → "A palindrome is a word, phrase, or sequence that reads the
#    same forwards and backwards. Examples include 'racecar'..."

书中还使用 Ollama + Llama 3 来自动评估生成质量：让一个更强的模型给微调后的 GPT-2 的回答打分 (0-100)。平均得分约 49.45 — 对于一个 355M 参数的模型来说是合理的，但也说明小模型在指令遵循能力上的局限。

5. 为什么这很重要

指令微调是从「语言模型」到「AI 助手」的关键转变。以下是几个核心启示：

ChatGPT 的基础：ChatGPT 本质上就是一个经过指令微调（+ RLHF）的 GPT 模型。理解本章的流程，就理解了对话 AI 产品的技术骨架：预训练 → SFT (Supervised Fine-Tuning) → RLHF。
数据质量 > 数据量：指令微调不需要海量数据。LIMA 论文证明仅 1000 条高质量指令就能产生显著效果。关键是数据的多样性和质量，而非数量。
格式即约束：Alpaca 模板看似简单，但它定义了模型的「协议」— 模型学会了在看到 “### Response:” 后生成回答。现代 LLM 使用更复杂的 chat template（如 ChatML），但原理相同。
ignore_index 的工程价值：在实际的指令微调中，通常还会进一步 mask 掉指令部分（只在回答部分计算 loss），让模型专注于学习「如何回答」而非「记住指令格式」。
评估的挑战：自动评估生成质量远比评估分类准确率困难。书中用 LLM-as-judge 的方法（让强模型评分）是一种常见做法，但并不完美。生成任务的评估仍是一个开放问题。

ℹ️ 总结 — 第7章指令微调流程：准备 (instruction, input, output) 格式的数据 → 用 Alpaca 模板格式化 → 自定义 collate_fn 处理变长和填充 → 在预训练 GPT-2 上继续训练（本质仍是 next-token prediction）→ 推理时提供指令 + “### Response:” 提示，让模型补全回答。这就是将 GPT 从「续写机器」变成「指令助手」的完整过程。

本文是 Build a Large Language Model From Scratch (Sebastian Raschka) 的学习笔记。所有配图版权归原作者所有。代码基于原书示例，有简化和中文注释。